Optimal Feature Subset Selection Using Similarity-Dissimilarity Index and Genetic Algorithms
نویسنده
چکیده
Optimal feature subset selection is an important pre-processing step for classification in many real life problems where number of dimensions of feature space is large and some features are may be irrelevant or redundant. One example of such a situation is genes expression profile data to classify among normal and cancerous samples. Contribution of this paper is five folds. Similarity-dissimilarity index (MSDI) is proposed which can estimate the class discrimination quality of the high dimensional feature space without using any kind of classifier. A framework to find out the best features subset from the n-dimensional feature space using genetic algorithm is proposed to select the minimum possible important features optimally using MSDI as fitness function to evolve the population. Similarity-dissimilarity plot is proposed to visualize the high dimensional data that can be used to extract important information about the class discrimination quality of the feature space. It is possible to predict the best classification accuracy using MSDI when an appropriate classifier is used. Another index called average differential of similarity and dissimilarity distances above similarity-dissimilarity line is proposed which gives information about how far each class instances or clusters are from other classes and the compactness of the classes in the feature space. Effectiveness of the methods is highlighted by using a large set of benchmark datasets in cancer classification and size of features subset and predicted classification accuracy is compared with the published results.
منابع مشابه
A Parallel Genetic Algorithm Based Method for Feature Subset Selection in Intrusion Detection Systems
Intrusion detection systems are designed to provide security in computer networks, so that if the attacker crosses other security devices, they can detect and prevent the attack process. One of the most essential challenges in designing these systems is the so called curse of dimensionality. Therefore, in order to obtain satisfactory performance in these systems we have to take advantage of app...
متن کاملA Parallel Genetic Algorithm Based Method for Feature Subset Selection in Intrusion Detection Systems
Intrusion detection systems are designed to provide security in computer networks, so that if the attacker crosses other security devices, they can detect and prevent the attack process. One of the most essential challenges in designing these systems is the so called curse of dimensionality. Therefore, in order to obtain satisfactory performance in these systems we have to take advantage of app...
متن کاملImprovement of effort estimation accuracy in software projects using a feature selection approach
In recent years, utilization of feature selection techniques has become an essential requirement for processing and model construction in different scientific areas. In the field of software project effort estimation, the need to apply dimensionality reduction and feature selection methods has become an inevitable demand. The high volumes of data, costs, and time necessary for gathering data , ...
متن کاملSequential and Mixed Genetic Algorithm and Learning Automata (SGALA, MGALA) for Feature Selection in QSAR
Feature selection is of great importance in Quantitative Structure-Activity Relationship (QSAR) analysis. This problem has been solved using some meta-heuristic algorithms such as: GA, PSO, ACO, SA and so on. In this work two novel hybrid meta-heuristic algorithms i.e. Sequential GA and LA (SGALA) and Mixed GA and LA (MGALA), which are based on Genetic algorithm and learning automata for QSAR f...
متن کاملOnline Streaming Feature Selection Using Geometric Series of the Adjacency Matrix of Features
Feature Selection (FS) is an important pre-processing step in machine learning and data mining. All the traditional feature selection methods assume that the entire feature space is available from the beginning. However, online streaming features (OSF) are an integral part of many real-world applications. In OSF, the number of training examples is fixed while the number of features grows with t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015